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Abstract 



o 

The /c-nearest-neighbour procedure is a well-known deterministic method used in 
j ; supervised classification. This paper proposes a reassessment of this approach as a 

C$ statistical technique derived from a proper probabilistic model; in particular, we mod- 

c/3 ify the assessment made in a previous analysis of this method undertaken by |Holmes| 

and Adams (2002 2003 1, and evaluated bylManocha and Girolami (20071, where the 



underlying probabilistic model is not completely well-defined. Once a clear proba- 
bilistic basis for the fc-nearest-ncighbour procedure is established, we derive computa- 
t tional tools for conducting Bayesian inference on the parameters of the corresponding 

*^ model. In particular, we assess the difficulties inherent to pseudo-likelihood and to 

path sampling approximations of an intractable normalising constant, and propose 
a perfect sampling strategy to implement a correct MCMC sampler associated with 
our model. If perfect sampling is not available, we suggest using a Gibbs sampling 
approximation. Illustrations of the performance of the corresponding Bayesian clas- 
sifier are provided for several benchmark datasets, demonstrating in particular the 
r* limitations of the pseudo-likelihood approximation in this set-up. 

Keywords: Bayesian inference, classification, compatible conditionals, Boltzmann 
5-j model, normalising constant, pseudo- likelihood, path sampling, perfect sampling, 

MCMC algorithm. 



1 Introduction 

1.1 Deterministic versus statistical classification 

Supervised classification has long been used in both Machine Learning and Statistics 
to infer about the functional connection between a group of covariates (or explanatory 



1 



variables) and a vector of indicators (or classes) (see, e.g., McLachlan 1992 Ripley 1994 



1996; Devroye et al. 



(Freund and Schapire 



1996 



Hastie et al. 2001| ). For instance, the method of boosting 



1997) has been developed for this very purpose by the Machine 



Learning community and has also been assessed and extended by statisticians (Hastie 



, 2002 


2003 


Biihlmann 


2004 



The /c-nearest-neighbour method is a well-established and straightforward technique 



in this area with both a long past and a fairly resilient resistance to change (Ripley, 1994 



1996). Nonetheless, while providing an instrument for classifying points into two or more 
classes, it lacks a corresponding assessment of its classification error. While alternative 
techniques like boosting offer this assessment, it is obviously of interest to provide the 
original £;-nearest-neighbour method with this additional feature. In contrast, statistical 
classification methods that are based on a model such a mixture of distributions do pro- 
vide an assessment of error along with the most likely classification. This more global 
perspective thus requires the technique to be embedded within a probabilistic frame- 



work in order to give a proper meaning to the notion of classification error. Holmes and 



Adams (2002) propose a Bayesian analysis of the /c-nearest-neighbour-method based on 



these premises, and we refer the reader to this paper for background and references. In 
a separate paper, Holmes and Adams (2003) defined another model based on autologis- 
tic representations and conducted a likelihood analysis of this model, in particular for 
selecting the value of k. While we also adopt a Bayesian approach, our paper differs 
from Holmes and Adams (2002) in two important respects: first, we define a global prob- 
abilistic model that encapsulates the fc-nearest-neighbour method, rather than working 
with incompatible conditional distributions, and, second, we derive a fully operational 
simulation technique adapted to our model and based either on perfect sampling or on 
a Gibbs sampling approximation, that allows for a reassessment of the pseudo-likelihood 
approximation often used in those settings. 



1.2 The original A>nearest-neighbour method 

Given a training set of individuals allocated each to one of G classes, the classical k- 
nearest-neighbour procedure is a method that allocates new individuals to the most com- 
mon class in their neighbourhood among the training set, the neighbourhood being defined 
in terms of the covariates. More formally, based on a training dataset ((yi, Xi))™ =1 , where 
Hi £ {1, . . . , G} denotes the class label of the ith point and x% £ W 3 is a vector of covari- 
ates, an unobserved class y n +i associated with a new set of covariates x n+ \ is estimated 
by the most common class among the k nearest neighbours of x n +\ in the training set 
(xi)JLj. The neighbourhood is defined in the space of the covariates Xi, namely 

Afn+l = I 1 < * < n 'i d(xi,X n+1 ) < d(;X n+ i)( k )} , 
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where d(-,x n+ i) denotes the vector of distances to x n+ \ and d(-, x n+ \)^ denotes the 
fcth order statistic. The original A;- nearest- neighbour method usually uses the Euclidean 
norm, even though the Mahalanobis distance would be more appropriate in that it rescales 
the covariates. Whenever ties occur, they are resolved by decreasing the number k of 
neighbours until the problem disappears. When some covariates are categorical, other 
types of distance can be used instead, as in the R package knncat of Buttrey (1998 1 . 

As such, and as also noted in Holmes and Adams (20021, the method is both deter- 
ministic, given the training dataset, and not parameterised, even though the choice of k is 
both non-trivial and relevant to the performance of the method. Usually, k is selected via 
cross-validation, as the number of neighbours that minimises the cross-validation error 
rate. In contrast to cluster-analysis set-ups, the number G of classes in the /c-nearest- 
neighbour procedure is fixed and given by the training set: the introduction of additional 
classes that are not observed in the training set has no effect on the future allocations. 




Figure 1: Training (top) and test (bottom) groups for Ripley's benchmark: the points in 
red are those for which the label is equal to 1 and the points in black are those for which 
the label is equal to 2. 



To illustrate the original method and to compare it later with our approach, we use 
throughout a toy benchmark dataset taken from Ripley ( 1994[ ). This dataset corresponds 
to a two-class classification problem in which each (sub) population of covariates is sim- 
ulated from a bivariate normal distribution, both populations being of equal sizes. A 
sample of n = 250 individuals is used as the training set and the model is tested on a 
second group of m = 1, 000 points acting as a test dataset. Figure [l] presents the datasetfj] 



1 This dataset is available at 



http:/ /www. stats. ox. ac.uk/pub/PRNN 
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and Table [T] displays the performance of the standard /c-nearest-neighbour method on the 
test dataset for several values of k. The overall misclassification leave-one-out error rate 
on the training dataset as k varies is provided in Figure [2] and it shows that this criterion 
is not very discriminating for this dataset, with little variation for a wide range of values 
of k and with several values of k achieving the same overall minimum, namely 17, 18, 35, 
36, 45, 46, 51, 52, 53 and 54. There are therefore ten different values of k in competition. 
This range of values is an indicator of potential gains when averaging over k, and hence 
calls for a Bayesian perspective. 



k Misclassification 
error rate 



1 


0.150 


3 


0.134 


15 


0.095 


17 


0.087 


31 


0.084 


54 


0.081 



Table 1: k- nearest- neighbour performances on the Ripley test dataset 
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k 



Figure 2: Misclassification leave-one-out error rate as a function of k for Ripley's training 
dataset. 
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1.3 Goal and plan 



As presented above, the /c-nearest-neighbour method is merely an allocation technique 
that does not account for uncertainty. In order to add this feature, we need to introduce 
a probabilistic framework that relates the class label y% to both the covariates Xi and the 
class labels of the neighbours of X{. Not only does this perspective provide more infor- 
mation about the variability of the classification, when compared with the point estimate 
given by the original method, but it also takes advantage of the full (Bayesian) inferen- 
tial machinery to introduce parameters that measure the strength of the influence of the 
neighbours, and to analyse the role of the variables, of the metric used, of the number k of 
neighbours, and of the number of classes towards achieving higher efficiency. Once again, 
this statistical viewpoint was previously adopted by Holmes and Adams (2002 2003) and 
we follow suit in this paper, with a modification of their original model geared towards a 
coherent probabilistic model, while providing new developments in computational model 
estimation. 

In order to illustrate the appeal of adopting a probabilistic perspective, we provide in 
Figure [3] two graphs that are by-products of our Bayesian analysis. For Ripley's dataset, 
the first graph (on the left) gives the level sets of the predictive probabilities to be in 
the black class, while the second graph (on the right) partitions the square into three 
zones, namely sure allocation to the red class, sure allocation to the black class and an 
uncertainty zone. Those three sets are obtained by first computing 95% credible intervals 
for the predictive probabilities and then checking those intervals against the borderline 
value 0.5. If the interval contains 0.5, the point is ranked as uncertain. 

The paper is organised as follows. We establish the validity of the new probabilistic k- 
nearest- neighbour model in Section [2 pointing out the deficiencies of the models advanced 
by Holmes and Adams (2002 2003), and then cover the different aspects of running 
Bayesian inference in this ^-nearest-neighbour model in Section [3] addressing in particular 
the specific issue of evaluating the normalising constant of the probabilistic /c-nearest- 
neighbour model that is necessary for inferring about k and an additional parameter. 
We take advantage of an exact MCMC approach proposed in Section 3.4 to evaluate the 
limitations of the pseudo-likelihood alternative in Section |3.5| and illustrate the method 
on several benchmark datasets in Section |4j 



2 The probabilistic /c-nearest-neighbour model 
2.1 Markov random field modelling 

In order to build a probabilistic structure that reproduces the features of the original 
fe-nearest-neighbour procedure and then to estimate its unknown parameters, we first 
need to define a joint distribution of the labels yi conditional on the covariates Xj, for 
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Figure 3: (left) Level sets of the predictive probability to be in the black class, ranging 
from high (white) to low (red), and (right) consequences of the comparison with 0.5 of 
the 95% credibility intervals for the predictive probabilities. (These plots are based on 



an MCMC sample whose derivation is explained in Section 3.4 



the training dataset. A natural approach is to take advantage of the spatial structure of 
the problem and to use a Markov random field model. Although we will show below that 
this is not possible within a coherent probabilistic setting, we could thus assume that the 
full conditional distribution of yi given y_j = (yi, . . . , yi-\, Vi+i, ■ ■ ■ , y n ) and the Xj's only 
depends on the k nearest neighbours of x% in the training set. The parameterised structure 
of this conditional distribution is obviously open but we opt for the most standard choice, 



namely, like the Potts model, a Boltzmann distribution (M0ller and Waagepetersen 2003 1 
with potential function 

where t ~^ i means that the summation is taken over the observations xg belonging 
to the k nearest neighbours of Xj, and 5 a (b) denotes the Dirac function. This function 
actually gives the number of points from the same class yi as the point Xi that are among 



the k nearest neighbours of X{. As in Holmes and Adams (2003), the expression for the 
full conditional is thus 



f(yi\y-i,X, f3,k) = exp /? Y Mw) A / E ex p p E w A w 

where (5 > and X is the (p, n) matrix {x±, . . . , x n } of coordinates for the training set. 
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In this parameterised model, (3 is a quantity that is obviously missing from the orig- 
inal /c-nearest-neighbour procedure. It is only relevant from a statistical point of view 
as a degree of uncertainty: (5 = corresponds to a uniform distribution over all classes, 
meaning independence from the neighbours, while j3 = +oo leads to a point mass distri- 
bution at the prevalent class, corresponding to extreme dependence. The introduction of 
the scale parameter k in the denominator is useful in making (3 dimensionless. 

There is, however, a difficulty with this expression in that, for almost all datasets 
X, there does not exist a joint probability distribution on y = (yi, ■ ■ ■ ,y n ) with full 
conditionals equal to ([TJ . This happens because the fe-nearest-neighbour system is usually 
asymmetric: when Xi is one of the k nearest neighbours of Xj ■ X j IS not necessarily one of 
the k nearest neighbours of Xj. Therefore, the pseudo-conditional distribution ([TJ will not 
depend on xj while the equivalent for xj does depend on xf. this is obviously impossible 
in a coherent probabilistic framework (Besag 1974 Cressie 19931 



One way of overcoming this fundamental difficulty is to follow Holmes and Adams 



(20021 and to define directly the joint distribution 




G 



J^exp 



(2) 



9=1 



Unfortunately, there are drawbacks to this approach, in that, first, the function ([2]) is not 
properly normalised (a fact overlooked by |Holmes and Adams , 2002 1 , and the necessary 



normalising constant is intractable. Second, the full conditional distributions correspond- 
ing to this joint distribution are not given by ([TJ. The first drawback is a common occur- 
rence with Boltzmann models and we will deal with this difficulty in detail in Section [3] 
At this stage, let us point out that the most standard approach to this problem is to use 



pseudo-likelihood following Besag et al. (19911, as in Heikkinen and Hogmander ( 1994 ) 



and Hoeting et al. (19991, but we will show in Section 3.5 that this approximation can 



give poor results. (See, e.g., Friel et al. (2005[) for a discussion of this point.) The second 



and more specific drawback implies that (|2j) cannot be treated as a pseudo-likelihood 
(Besag 1974 Besag et al. 1991|since, as stated above, the conditional distribution ([TJ 
cannot be associated with any joint distribution. That ([2j misses a normalising constant 
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can be seen from the special case in which n = 2, y = (yi, y^) and G = 2, since 



2/1=1 2/2=1 i=l \ I J I g=\ \ i^ ki 

2 2 / 

= E E ex p^fe(^)+^(yi)]A) / (iw 

2/1=1 J/2 = l / 

= 2 fl + e 2/3M /fi + e ^ 2 , 



which is clearly different from 1 and, more importantly, depends on both (3 and k. We 
note that the debate about whether or not one should use a proper joint distribution is 
reminiscent of the opposition between Gaussian conditional autoregressions (CAR) and 



Gaussian intrinsic autoregressions in Besag and Kooperberg (19951, the latter not being 
associated with any joint distribution. 

2.2 A symmetrised Boltzmann modelling 

Given these difficulties, we therefore adopt a different strategy and define a joint model 
on the training set as 



/(y|X,/3,A0 = exp ^EW/* / Z (^k), (3) 

where Z([3, k) is the normalising constant of the distribution. The motivation for this 
modelling is that the full conditional distributions corresponding to Q can be obtained 
as 



/(j/i|y_i, X, f3, k) oc exp 




where i ~& I means that the summation is taken over the observations xg for which x% is 
a A:-nearest neighbour. Obviously, these conditional distributions differ from ([!]) if only 
because of the impossibility result mentioned above. The additional term in the potential 
function corresponds to the observations that are not among the nearest neighbours of X{ 
but for which Xj is a nearest neighbour. In this model, compared with single neighbours, 
mutual neighbours are given a double weight. This feature is of importance in that this 
coherent model defines a new classification criterion that can be treated as a competitor 
of the standard /c-nearest-neighbour objective function. Note also that the original full 
conditional ([TJ is recovered as Q when the system of neighbours is perfectly symmetric 
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(up to a factor 2). Once again, the normalising constant Z(f3, k) is intractable, except for 
the most trivial cases. 

In the case of unbalanced sampling, that is, if the marginal probabilities p\ = ¥(y = 
1),...,PG = IP(y = G) are known and are different from the sampling probabilities 
Pi = n i/ n , ■ ■ ■ ,PG = n G/ n , where n g is the number of training observations arising 
from class g, a natural modification of this £;-nearest-neighbour model is to reweight the 
neighbourhood sizes by a g = p g n/n g . This leads to the modified model 

/(y|X,/3 J fe) = exp (fi E a Vi E 5 yi {y & ) j k\ j Z((3,k). 

This modification is useful in practice when we are dealing with stratified surveys. In the 
following, however, we assume that a g = 1 for all g = 1, . . . , G. 



2.3 Predictive perspective 

When based on the conditional expression Q, the predictive distribution of a new un- 
classified observation given its covariate x n +i and the training sample (y,X) is, for 
g = l,...,G, 



n+ i = g\x n+ i,y,X,@,k) cx exp < 



t~k(n+l) (n+l)~^ 



(5) 



where 



E 



and E S vti9) 



<~ fc (n+l) (n+l)~fc^ 
are the numbers of observations in the training dataset from class g among the k nearest 
neighbours of x n+ \ and among the observations for which x n +i is a /c-nearest neigh- 
bour, respectively. This predictive distribution can then be incorporated in the Bayesian 
inference process by considering the joint posterior of (/3,k,y n+ i) and by deriving the 
corresponding marginal posterior distribution of y n +i- 

While this model provides a sound statistical basis for the fe-nearest-neighour method- 
ology as well as a means of assessing the uncertainty of the allocations to classes of 
unclassified observations, and while it corresponds to a true, albeit unavailable, joint dis- 
tribution, it can be criticised from a Bayesian point of view in that it suffers from a lack 
of statistical coherence (in the sense that the information contained in the sample is not 
used in the most efficient way) when multiple classifications are considered. Indeed, the 
/c-nearest-neighbour methodology is invariably used in a repeated manner, either jointly 
on a sample (x n+ i, . . . , x n+m ) or sequentially. Rather than assuming simultaneously de- 
pendence in the training sample and independence in the unclassified sample, it would 
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be more sensible to consider the whole collection of points as issuing from a single joint 
model of the form given by ([3]), but with some having their class missing at random. 
Always reasoning from a Bayesian point of view, addressing jointly the inference on the 
parameters (f3, k) and on the missing classes (y n +i> • • • ,Un+m) — i.e. assuming exchange- 
ability between the training and the unclassified datapoints — certainly makes sense from a 
foundational perspective as a correct probabilistic evaluation and it does provide a better 
assessment of the uncertainty about the classifications as well as about the parameters. 

Unfortunately, this more global and arguably more coherent perspective is mostly 
unachievable if only for computational reasons, since the set of the missing class vector 
(y n +i, ■ ■ ■ , Un+m) is of size G m . It is practically impossible to derive an efficient simulation 
algorithm that would correctly approximate the joint probability distribution of both 
parameters and classes, especially when the number m of unclassified points is large. We 
will thus adopt the more ad hoc approach of dealing separately with each unclassified 
point in the analysis, because this simply is the only realistic way. This perspective can 
also be justified by the fact that, in realistic machine learning set-ups, the unclassified 
data (y n +i, ■ ■ ■ ,Vn+m) mostly occur in a sequential environment with, furthermore, the 
true value of y n +i being revealed before y n +2 is observed. 



In the following sections, we mainly consider the case G = 2 as in Holmes and Adams 



(20031, because this is the only case where we can conduct a full comparison between 
different approximation schemes, but we indicate at the end of Section 3.4 how a Gibbs 
sampling approximation allows for a realistic extension to larger values of G, as illustrated 
in Section HI 



3 Bayesian inference and the normalisation problem 

Given the joint model Q for (yi, . . . , y n +i), Bayesian inference can be conducted in a 



standard manner (Robert, 2001), provided computational difficulties related to the un- 
availability of the normalising constant can be solved. Indeed, as stressed in the previous 
section, from a Bayesian perspective, the classification of unclassified points can be based 
on the marginal predictive (or posterior) distribution of y n +i obtained by integration over 
the conditional posterior distribution of the parameters, namely, for g = 1,2, 

HVn+l = 5|z n+ i,y,X) = ^ / F (Vn+i = #kn+i,y,X,/3, k)ir((3,k\y,X.)df3 , (6) 

k J 

where 7r(/3, fc|y, X) oc /(y|X, /?, k)Tr(/3, k) is the posterior distribution of (P,k) given the 
training dataset (y, X). While other choices of prior distributions are available, we choose 
for (k, (3) a uniform prior on the compact support {1, . . . , K} x [0, /3 max ]. The limitation 
on k is imposed by the structure of the training dataset in that K is at most equal to 
the minimal class size, min(m,n2), while the limitation on 0, (3 < /3 m ax, is customary 
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in Boltzmann models, because of phase-transition phenomena (M0ller 2003 1 : when (3 
is above a certain value, the model becomes "all black or all white", i.e. all y^s are 
either equal to 1 or to 2. (This is illustrated in Figure [5] below by the convergence 
of the expectation of the number of identical neighbours to k.) The determination of 
/3 m ax is obviously problem-specific and needs to be solved afresh for each new dataset 
since it depends on the topology of the neighbourhood. It is however straighforward to 
implement in that a Gibbs simulation of ^ for different values of (3 quickly exhibits the 
"black-or- white" features. 

3.1 MCMC steps 

Were the posterior distribution tt((3, k\y, X) available (up to a normalising constant), we 
could design an MCMC algorithm that would produce a Markov chain approximating 



a sample from this posterior (Robert and Casella 2004 1, for example through a Gibbs 



sampling scheme based on the full conditional distributions of both k and j3. However, 
because of the associated representation Q, the conditional distribution of f3 is non- 
standard and we need to resort to a hybrid sampling scheme in which the exact simulation 
from 7r(/3|fc, y, X) is replaced with a single Metropolis-Hastings step. Furthermore, use of 
the full conditional distribution for k can impose fairly severe computational constraints. 
Indeed, for a given value computing the posterior /(y|X, i)"7r(/?W, i), for i = 
1, . . . ,K, requires computations of order O(KnG), once again because of the likelihood 
representation. A faster alternative is to use a hybrid step for both [3 and k: in this way, 
we only need to compute the full conditional distribution of k for one new value of k, 
modulo the normalising constant. 

An alternative to Gibbs sampling is to use a random walk Metropolis-Hastings algo- 
rithm: both (3 and k are then updated using random walk proposals. Since [3 G (0, /3 m ax) 
is constrained, we first introduce a logistic reparameterisation of (3, 

(3 = /3 max exp(#)/(exp(fl) + 1) , 

and then propose a normal random walk on the #'s, 6' ~ J\f{9 <yt \T 2 ). For k, we use 
instead a uniform proposal on the 2r neighbours of k^\ namely {k^' — r,...,k^' — 
1, k^' + 1, . . . few + r} n{l' • • • ? If}- This proposal distribution with probabiltity density 
Q r {k, •), with k' ~ Q r (k^ t ~ 1 ', •), thus depends on a parameter r e {1, . . . , K} that needs 
to be calibrated so as to aim at optimal acceptance rates, as does r 2 . The acceptance 
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probability in the Metropolis-Hastings algorithm is thus 



f(y\X,P>,k')ir((3>,k') / Qriki*- 1 )^') 



/(y|X, fc(*- 1 ))vr(/3(*- 1 ), k^- 1 )) j Q r (k', feC*" 1 )) 

exp^Q/q + exp^)) 2 
X exp(^- 1 ))/(l + exp(^- 1 ))) 2 ' 

where the second ratio is the ratio of the Jacobians due to the reparameterisation. 

Once the Metropolis-Hastings algorithm has produced a satisfactory sequence of 
(/?, k)'s, the Bayesian prediction for an unobserved class y n +i associated with x n+ \ is 
derived from Q. In fact, if we use a — 1 loss function (Robert 2001) for predicting 
y n+ i, namely 

L(y n+1 ,y n+1 ) = Iy n+1 ^ Vn+1 , 

the Bayes estimator is the most probable class g according to the posterior predictive 
Q. The associated measure of uncertainty is then the posterior expected loss, F(y n+ i = 
g\x n+1 ,y,X). 

Explicit calculation of ([6| is obviously impossible and this distribution must be ap- 
proximated from the MCMC chain k)^\ . . . , k)( M ^} simulated above, namely by 



M 



M-^P j/ n+1 = g\x n+1 ,y,X,(f3,k) 



(7) 



i=l 



Unfortunately, since ^ involves the intractable constant Z(j3, k), the above schemes 
cannot be implemented as such and we need to replace / with a more manageable target. 
We proceed below through three different approaches that try to overcome this difficulty, 
postponing the comparison till Section [3 .5 1 



3.2 Pseudo-likelihood approximation 



A first solution, dating back to Besag ( 1974), is to replace the true joint distribution with 
the pseudo-likelihood, defined as 



exp ^ p/k I Y Mw) + Mm) 

/(y|X,/3,A0 = n — 

i_1 J^exp \p/k\J2 6 9(ve) + Yl Ms) 

9=1 



(8) 
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and made up of the product of the (true) conditional distributions associated with Q. 
The true posterior distribution tt(/3, fc|y,X) is then replaced with 

n((3,k\y,X) k f(y\X,P,kM/3,k) , 

and used as such in all steps of the MCMC algorithm drafted above. The predictive 
distribution P(y n+ i = g\x n+ i, y, X) is correspondingly approximated by ([7]), based on 
the pseudo-sample thus produced. 

While this replacement of the true distribution with the pseudo-likelihood approxi- 
mation induces a bias in the estimation of (k, (3) and in the predictive performance of the 
Bayes procedure, it has been intensively used in the past, if only because of its availability 



and simplicity. For instance, Holmes and Adams (2003) built their pseudo-joint distri- 



bution on such a product (with the difficulty that the components of the product were 



not true conditionals). As noted in Friel and Pettitt (2004), pseudo-likelihood estimation 
can be very misleading and we will describe its performance in more detail in Section [3 .5 1 
(To the best of our knowledge, this Bayesian evaluation has not been conducted before.) 

As illustrated on Figure [4] for Ripley's benchmark data, the random walk Metro- 
polis-Hastings algorithm detailed above performs satisfactorily with the pseudo-likelihood 
approximation, even though the mixing is slow (cycles can be spotted on the bottom left 
graph). On that dataset, the pseudo-maximum-i.e., the maximum of Q-is achieved 
for k = 53 and (5 = 2.28. If we use the last 10,000 iterations of this MCMC run, the 
prediction performance of Q is such that the error rate on the test set of 1000 points 
is 8.7%. Figure [4] also indicates how limited the information is about k. (Note that we 
settled on the value /3 max = 4 by trial- and-error.) 



3.3 Path sampling 

A now-standard approach to the estimation of normalising constants is path sampling, 
described in|Gelman and Meng (1998) (see also Chen et al. 2000), and derived from the 



Ogata (1989) method, in which the ratio of two normalising constants, Z(f3' ,k) /Z((3,k), 
can be decomposed as an integral to be approximated by Monte Carlo techniques. 
The basic derivation of the path sampling approximation is that, if 



then 



Z(/3,A:) = ^exp[/35(y)] 
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T 1 1 1 1 ^ I 1 1 1 1 1 

10000 30000 50000 1.0 1.5 2.0 2.5 3.0 3.5 




10000 30000 50000 13 22 31 40 49 58 67 76 



Figure 4: Output of a random walk Metropolis-Hastings algorithm based on the pseudo- 
likelihood approximation of the normalising constant for 50, 000 iterations, with a 40, 000 
iteration burn-in stage, and r 2 = 0.05, r = 3. (top) sequence and marginal histogram for 
(3 when /3 max = 4 and (bottom) sequence and marginal barplot for k. 
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and 



= £S(y)exp0S(y)] 

= Z((3,k) ^2S(y)exp(f3S(y))/Z(f3,k) 
y 

= Z(J3,k)M p [S(y)]. 
Therefore, the ratio Z(/3, k)/Z(/3', k) can be derived from an integral, since 



\og{Z{p,k)/Z{p,k)}= [ E Utk [S(y)]du, 

J0 



which is easily evaluated by a numerical approximation. 

The practical drawback with this approach is that each time a new ratio is to be 
computed, that is, at each step of a hybrid Gibbs scheme or of a Metropolis-Hastings 
proposal, an approximation of the above integral needs to be produced. A further step is 
thus necessary for path sampling to be used: we approximate the function Z([3, k) only 
once for each value of k and for a few selected values of 0, and later we use numerical 
interpolation to extend the function to other values of 0. Since the function Z(/3, k) is 
very smooth, the degree of additional approximation is quite limited. Given that this 
approximation is only to be computed once, the resulting Metropolis-Hastings algorithm 
is very fast, as well as being efficient if enough care is taken with the approximation by 
checking that the slope of Z(f3, k) is sufficiently smooth from one value of /3 to the next. 
(We stress however that the computational cost required to produce those approximations 
is fairly high, because of the joint approximation in k).) 

We illustrate this approximation using Ripley's benchmark dataset. Figure [5] provides 
the approximated expectations E^fS^y)] for a range of values of f3 and for two values of 
k. Within the expectation, the y's are simulated using a systematic scan Gibbs sampler, 



because using the perfect sampling scheme elaborated below in Section 3.4 makes little 
sense when only one expectation needs to be computed. As seen from this comparative 
graph, when (5 is small, the Gibbs sampler gives good mixing performance, while, for 
larger values, it has difficulty in converging, as illustrated by the poor fit on the right- 
hand plot when k = 125. The explanation is that the model is getting closer to the 
phase-transition boundary in that case. 

For the approximation of Z(f3,k), we use the fact that E J g ! fe[<5(y)] is known when 
[3 = 0, namely E 0> fc [£?(y)] = n/2. We can thus represent log{Z(/3, k)} as 

rP 



nlog2+ f E Uifc [5(y)]dn 
Jo 
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Figure 5: Approximation of the expectation E« fe[5'(y)] for Ripley's benchmark, where 
the 0's are equally spaced between and /9 max = 4, and for k = 1 (left) and k = 125 
(right) (10 4 iterations with 500 burn-in steps for each value of ((3,k)). On these graphs, 
the black curve is based on linear interpolation of the expectation and the red curve on 
second-order spline interpolation. 



and use numerical integration to approximate the integral. As shown on Figure [6j which 
uses a bilinear interpolation based on a 50 x 12 grid of values of k), the approximated 
constant Z{(3, k) is mainly constant in k. 

Once Z(f3, k) has been approximated, we can use the genuine MCMC algorithm of 
Section |3.1| fairly easily, the main cost of this approach being thus in the approximation 
of Z(P, k). Figure [7] illustrates the output of the MCMC sampler for Ripley's benchmark, 
to be compared with Figure |4} A first item of interest is that the chain mixes much more 
rapidly (in terms of iterations) than its pseudo-likelihood counterpart. A more important 
point is that the range and shape of the approximations to both marginal posterior 



distributions differ widely between the two methods, a feature discussed in Section 3.5 



When this output of the MCMC sampler is used for prediction purposes in Q, the error 
rate for Ripley's test set is equal to 8.5%. 



3.4 Perfect sampling implementation and Gibbs approximation 

A completely different approach to handling missing normalising constants has been de- 
veloped recently by M0ller et al. (20061 and is based on an auxiliary variable idea. If we 
introduce an auxiliary variable z on the same state space as y, with arbitrary conditional 
density g(z|/3, k, y), and if we consider the joint posterior 

7r(/3,/c,z|y) oc n{/3,k, z,y) = g(z\(3,k,y) x f(y\(3,k) x n{/3,k) , 

then simulating {(5, k, z) from this posterior is equivalent to simulating (/3, k) from the 
original posterior since z integrates out. If we now run a Metropolis-Hastings algorithm 
on this augmented scheme, with q\ an arbitrary proposal density on {(5, k) and with 

q 2 (/?', k', z'l/3, k, z) = qi ((3', k'\(3, k, y)/(z'|/3', k') , 
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Figure 6: Approximation of the normalising constant Z(/3, k) for Ripley's dataset where 
the /3's are equally spaced between and /9 max = 4, and k = 1, 10, 20, . . . , 110, 125 (based 
on 10 4 Monte Carlo iterations with 500 burn-in steps, and bilinear interpolation). 




10000 30000 50000 1.0 1.2 1.4 1.6 1.8 



o 




10000 30000 50000 11 14 17 20 23 26 29 32 



Figure 7: Output of a random walk Metropolis-Hastings algorithm based on the path 
sampling approximation of the normalising constant for 50, 000 iterations, with a 40, 000 
iteration burn-in stage and r 2 = 0.05, r = 3. (top) sequence and marginal histogram for 
j3 when /3 max = 4 and (bottom) sequence and marginal barplot for k. 
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as the joint proposal on k, z) (i.e., simulating z directly from the likelihood), the 
Metropolis-Hastings ratio associated with q2 is 



Z((3,k)\ fe W (P'S(y)/k')7T(P',k')\ ( g(z'\(3', k',y 



Z(f3',k)J V exp(/3S(y)/fc)7r09,A!) ; V g(z\P,k,y. 

qi (P,k\P',k,y)ex V (pS( Z )/k) \ ( Z(P',k>) 
qi (f3',k'\P,k,y)exp((3'S(z)/k')J \ Z(0,k) 



which means that the constants Z(f3,k) and Z(f3',k') cancel out. The method of M0ller 



et al. (20061 can thus be calibrated by the choice of the artificial target g(z\/3,k,y) on 



the auxiliary variable z, and the authors advocate the choice 

g(z\P,k,y)=ex V (pS(z)/k) /Z0,k), 



as reasonable, where (j3, k) is a preliminary estimate, such as the maximum pseudo- 
likelihood estimate. While we follow this recommendation, we stress that the choice of 
(/?, k) is paramount for good performance of the algorithm, as explained below. The 
alternative of setting a target g{z\j3, k, y) that truly depends on (3 and k is appealing but 
faces computational difficulties in that the most natural proposals involve normalising 
constants that cannot be computed. 



Obviously, this approach of Moller et al. (2006) also has a major drawback, namely 



that the auxiliary variable z must be simulated from the distribution /(z|/3, k) itself. 
However, there have been many developments in the simulation of Ising models, from 
Besag (1974| to M0ller and Waagepetersen (2003), and the particular case G = 2 allows 



for exact simulation of f(z\(3, k) using perfect sampling. We refer the reader to Haggstrom 



( [2002] ), [M0ller| fl2003| |, |M0ller and Waagepetersen] fl2003| ) and |Robert and Casella| (2004, 
Chapter 13) for details of this simulation technique and for a discussion of its limitations. 
Without entering into technical details, we comment that, in the case of model ^ with 
G = 2, there also exists a monotone implementation of the Gibbs sampler that allows for 



a practical implementation of the perfect sampler (Kendall and M0ller, 2000 Berthelsen 
and M0ller 2003). More precisely, we can use a coupling- from-the-past strategy ( |Propp 



and Wilson 1998): in this setting, starting from the saturated situations in which the 



components of z are either all equal to 1 or all equal to 2, it is sufficient to monitor both 
associated chains further and further into the past until they coalesce by time 0. The 



sandwiching property of Kendall and M0ller ( 2000 ) and the monotonicity of the Gibbs 
sampler ensure that all other chains associated with arbitrary starting values for z will 
also have coalesced by then. The only difficulty with this perfect sampler is the phase- 
transition phenomenon, which means that, for very large values of (3, the convergence 
performance of the coupling from the past sampler deteriorates quite rapidly, a fact also 



noted in |M0ller et al. (2006) for the Ising model. We overcome this difficulty by using 
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Figure 8: Output of a random walk Metropolis-Hastings algorithm based on the perfect 
sampling elimination of the normalising constant for a pseudo-likelihood plug-in estimate 
{k,/3) = (53,2.28) and 20,000 iterations, with a 10,000 burn-in stage, (3 max = 4 and 
t 2 = 0.05, r = 3: (top) sequence and marginal histogram for f3 and (bottom) sequence 
and marginal barplot for k. 



an additional accept-reject step based on smaller values of j3 that avoids this explosion 
in the computational time. 

As shown on Figure]^ a poor choice for (/3, k) leads to very unsatisfactory performance 
with the algorithm. Starting from the pseudo-likelihood estimate and using this very value 
for the plug-in value (/?, k), we obtain a Markov chain with a very low energy and a very 
high rejection rate. However, use of the estimate (k,(3) = (13,1.45) resulting from this 
poor run does improve considerably the performance of the algorithm, as shown by Figure 
[9] In this setting, the predictive error rate on the test dataset is equal to 0.084. 

While this elegant solution based on an auxiliary variable completely removes the 
issue of the normalising constant, it faces several computational difficulties. First, as 
noted above, the choice of the artificial target g(z\/3,k,y) is driving the algorithm and 
plug-in estimates need to be reassessed periodicaly. Second, perfect simulation from 
the distribution f(z\/3, k) is extremely costly and may fail if j3 is close to the phase- 
transition boundary. Furthermore, the numerical value of this critical point is not known 
beforehand. Finally, the extension of the perfect sampling scheme to more than G = 2 
classes has not yet been achieved. 
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5000 10000 15000 20000 12 13 14 15 16 1.7 1 8 1 9 5000 10000 15000 20000 12 13 14 15 16 17 18 19 




t 1 1 1 r i 1 1 1 1 1 h 1 1 1 r i 1 1 1 1 1 

5000 10000 15000 20000 5 10 15 20 25 30 5000 10000 15000 20000 5 10 15 20 25 30 

Figure 9: Comparison of (left) the output of a random walk Metropolis-Hastings algo- 
rithm based on perfect sampling and of (right) the output of its Gibbs approximation 
for a plug-in estimate (k,/3) = (13,1.45) and 20,000 iterations, with a 10,000 burn-in 
stage and r 2 = 0.05, r = 3: (top) sequence and marginal histogram for and (bottom) 
sequence and marginal barplot for k. 
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For these different reasons, we advocate the substitution of a Gibbs sampler for the 
above perfect sampler in order to achieve manageable computing performance. If we 
replace the perfect sampling step with 500 (complete) iterations of the corresponding 
generic Gibbs sampler on z, the computing time is linear in the number n of observations 
and the results are virtually the same. One has to remember that the simulation of z is of 
second-order with respect to the original problem of simulating the posterior distribution 
of (/3, k), since z is an auxiliary variable introduced to overcome the computation of the 
normalising constant. Therefore, the additional uncertainty induced by the use of the 
Gibbs sampler is far from severe. Figure [9] compares the Gibbs solution with the perfect 
sampling implementation and it shows how little loss is incurred by the use of the less 
expensive Gibbs sampler, while the gain in computing time is enormous. For 50, 000 
iterations, the time required to run the Gibbs sampler is approximately 20 minutes, 
compared with more than a week for the corresponding perfect sampler (under the same 
C environment on the same machine). 



3.5 Evaluation of the pseudo-likelihood approximation 

Given that the above alternatives can all be implemented for small values of n, it is of 
direct interest to compare them in order to evaluate the effect of the pseudo-likelihood 
approximation. As demonstrated in the previous section, using Ripley's benchmark with 
a training set of 250 points, we are indeed able to run a perfect sampler over the range of 
possible /3's, and this implementation gives a sampler in which the only approximation is 
due to running an MCMC sampler (a feature common to all three versions). 

Histograms, for the same dataset, of simulated /3's, conditional or unconditional, on k 
show gross misrepresentation of the samples produced by the pseudo-likelihood approx- 



imation; see Figures 10 and 11 (The comparison for a fixed value of k was obtained 
directly by setting Ho a fixed value in all three approaches and running the correspond- 
ing MCMC algorithms.) It could of course be argued that the defect lies with the path 
sampling evaluation of the constant, but this approach strongly coincides with the per- 
fect sampling implementation, as showed on both figures. There is thus a fundamental 
discrepancy in using the pseudo-likelihood approximation; in other words, the pseudo- 
likelihood approximation defines a clearly different posterior distribution on (/?, k). 



As exhibited on Figure 10 the larger k is, the worse is this discrepancy, whereas Fig- 
ure 11 shows that both (3 and k are significantly overestimated by the pseudo-likelihood 
approximation. (It is quite natural to find such a correlation between (3 and k when we 
realise that the likelihood depends mainly on fi/k.) We can also note that the corre- 
spondence between path and perfect approximations is not absolute in the case of k, a 
difference that may be attributed to slower convergence in one or both samplers. 

In order to assess the comparative predictive properties of both approaches, we also 
provide a comparison of the class probabilities F(y = l|x,y,X) estimated at each point 
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12 3 4 



Figure 10: Comparison of the approximations to the posterior distribution of (3 based on 
the pseudo (red), the path (green) and the perfect (yellow) schemes for Ripley's bench- 
mark and k = 1, 10, 70, 125, for 20, 000 iterations and 10, 000 burn-in. 
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12 3 




Figure 11: Comparison of posterior distributions of (3 (top) and k (bottom) as represented 
in Figure [4] for the pseudo-likelihood approximation, in Figure [7] for the path sampling 
approximation and in Figure [9] for the perfect sampling approximation. 



23 



0.0 0.2 0.4 0.6 0.8 1.0 

Pseudo-likelihood approximation 



Figure 12: Comparison of the class probabilities P(y = l|x, y, X) estimated at each point 
of the testing sample. 



of the test sample. As shown by Figure 12 the predictions are quite different for values 
in the middle of the range, with no clear bias direction in using pseudo-likelihood as an 
approximation. Note that the discrepancy may be substantial and may result in a large 
number of different classifications. 



4 Illustration on real datasets 

In this Section, we illustrate the behaviour of the proposed methodology on some bench- 
mark datasets. 

We first describe the calibration of the algorithm used on each dataset. As starting 
value for the Gibbs approximation in the M0ller scheme, we use the maximum pseudo- 
likelihood estimate. The Gibbs sampler is iterated 500 times as an approximation to the 
perfect sampling step. After 10,000 iterations, we modify the plug-in estimate using the 
current average and then we run 50,000 more iterations of the algorithm. 

The first dataset is borrowed from the MASS library of R. It consists in the records of 
532 Pima Indian women who were tested by the U.S. National Institute of Diabetes and 
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Digestive and Kidney Diseases for diabetes. Seven quantitative covariates were recorded, 
along with the presence or absence of diabetes. The data are split at random into a 
training set of 200 women, including 68 diagnosed with diabetes, and a test set of the re- 
maining 332 women, including 109 diagnosed with diabetes. The performance for various 
values of k on the test dataset is given in Table [2] If we use a standard leave-one-out cross- 
validation for selecting k (using only the training dataset), then there are 10 consecutive 
values of k leading to the same error rate, namely the range 57-66. 



k Misclassification 
error rate 



1 


0.316 


3 


0.229 


15 


0.226 


31 


0.211 


57 


0.205 


66 


0.208 



Table 2: Performance of fc-nearest-neighbour methods on the Pima Indian test dataset. 



The results are provided in Figure 13 



Note that the simulated values of k tend 
to avoid the region found by the cross-validation procedure. One possible reason for 
this discrepancy is that, as noted in Section [2.2| the likelihood for our joint model is not 
directly equivalent to the fc-nearest-neighbour objective function, since mutual neighbours 
are weighted twice as heavily as single neighbours in this likelihood. Over the final 20, 000 
iterations, the prediction error is 0.209, quite in line with the /c-nearest-neighbour solution 
in Table 

To illustrate the ability of our method to consider more than two classes, we also used 



the benchmark dataset forensic glass fragments, studied in Ripley (1994). This dataset 
involves nine covariates and six classes some of which are rather rare. Following the 



recommendation made in Ripley (1994), we coalesced some classes to reduce the number 



of classes to four. We then randomly partitioned the dataset to obtain 89 individuals in 
the training dataset and 96 in the testing dataset. Leave-one-out cross-validation leads us 
to choose the value k = 17. The error rate of the 17-nearest-neighbour procedure on the 
test dataset is 0.35, whereas, using our procedure, we obtain an error rate of 0.29. The 
substantial gain from using our approach can be partially explained by the fact that the 
value of k chosen by the cross-validation procedure is much larger than those explored 
by our MCMC sampler. 
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Figure 13: Pima Indian diabetes study based on 50, 000 iterations of the Gibbs-M0ller 
sampling scheme with t 2 = 0.05, r = 3, /3 max = 4, and K = 68. 



5 Conclusions 

While the probabilistic background to a Bayesian analysis of /c-nearest-neighbour methods 



was initiated by Holmes and Adams ( 2003 ) , the present paper straightens the connection 
between the original technique and a true probabilistic model by defining a coherent 
probabilistic model on the training dataset. This new model Q then provides a sound 
setting for Bayesian inference and for evaluating not just the most likely allocations 
for the test dataset but also the uncertainty that goes with them. The advantages of 
using a probabilistic environment are clearly demonstrated: it is only within this setting 
that tools like predictive maps as in Figure [3] can be constructed. This obviously is a 
tremendous bonus for the experimenter, since boundaries between most likely classes can 
thus be estimated and regions can be established in which allocation to a specific class 
or to any class is uncertain. In addition, the probabilistic framework allows for a natural 
and integrated analysis of the number of neighbours involved in the class allocation, in 
a standard model-choice perspective. This perspective can be extended to the choice of 
the most significant components of the covariate x, even though this possibility is not 
explored in the current paper. 

The present paper also addresses the computational difficulties related to this ap- 
proach, namely the well-known issue of the intractable normalising constant. While this 



26 



has been thoroughly discussed in the literature, our comparison of three independent ap- 
proximations leads to the strong conclusion that the pseudo-likelihood approximation is 
not to be trusted for training sets of moderate size. Furthermore, while the path sampling 
and perfect sampling approximations are useful in establishing this conclusion, they can- 
not be advocated at the operational level, but we also demonstrate that a Gibbs sampling 
alternative to the perfect sampling scheme of M0ller et al. (20061 is both operational and 
practical. 
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